Source Language Markers in EUROPARL Translations
نویسنده
چکیده
This paper shows that it is very often possible to identify the source language of medium-length speeches in the EUROPARL corpus on the basis of frequency counts of word n-grams (87.2%96.7% accuracy depending on classification method). The paper also examines in detail which positive markers are most powerful and identifies a number of linguistic aspects as well as cultureand domain-related ones.1
منابع مشابه
Exploiting Translational Correspondences for Pattern-Independent MWE Identification
Based on a study of verb translations in the Europarl corpus, we argue that a wide range of MWE patterns can be identified in translations that exhibit a correspondence between a single lexical item in the source language and a group of lexical items in the target language. We show that these correspondences can be reliably detected on dependency-parsed, word-aligned sentences. We propose an ex...
متن کاملSyntax Augmented Machine Translation via Chart Parsing with Integrated Language Modeling
We present a hierarchical phrase-based translation model which annotates and generalizes existing phrase translations with syntactic categories derived from parsing the target side of a parallel corpus. We associate target parse trees for each training sentence pair with a search lattice constructed from the existing phrase translations on the corresponding source sentence, and consider techniq...
متن کاملSyntax Augmented Machine Translation via Chart Parsing with Integrated Language Modeling
We present a hierarchical phrase-based translation model which annotates and generalizes existing phrase translations with syntactic categories derived from parsing the target side of a parallel corpus. We associate target parse trees for each training sentence pair with a search lattice constructed from the existing phrase translations on the corresponding source sentence, and consider techniq...
متن کاملThe Use of Parallel and Comparable Data for Analysis of Abstract Anaphora in German and English
Parallel corpora — original texts aligned with their translations — are a widely used resource in computational linguistics. Translation studies have shown that translated texts often differ systematically from comparable original texts. Translators tend to be faithful to structures of the original texts, resulting in a “shining through” of the original language preferences in the translated te...
متن کاملA Comparative Study of Discourse Markers: The Case of three English Applied Linguistic Texts with their Farsi Translations
This research was an attempt to find the relationship between English discourse markers and their Farsi translations. It was conducted in order to find out whether DMs translations completely demonstrate source texts orientation and to what extent DMs translations are functionally appropriate compared to the original text? Six instruments were used. Three of them were the original English books...
متن کامل